Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space by dpsoft · Pull Request #79 · the-void-ia/void-box

dpsoft · 2026-05-06T12:26:13Z

What this branch does

Stops ignoring the guest's advertised TCP window and stops hardcoding our own. Three correctness/perf gaps closed:

Track the guest's advertised window — every incoming frame's window_len (scaled by window_scale from SYN options) is stashed on the flow.
Honor it on host→guest sends — relay_tcp_nat_data gates frames_to_inject on guest_window - bytes_in_flight, so the relay stops when the guest's receive buffer is full instead of pretending it's infinite. Phase 3's 256 KB cap was a band-aid for the symptom.
Advertise our own window from real backpressure — outgoing frames carry host_recv_window(fd) (computed from getsockopt(TCP_INFO).tcpi_rcv_space >> OUR_WINDOW_SCALE) instead of a hardcoded 65535. SYN-ACK negotiates window_scale: 7 (matches passt; 128× → 8 MiB max).

Headline win

Workload	Before	After
Host→guest send when guest is slow	unbounded `inject_to_guest` queue (Phase 3 capped it at 256 KB userspace cliff)	bounded by guest's `guest_window` (modern Linux: 4 MB+ scaled)
Window scale negotiation	none (max 64 KB / RTT)	7 (max 8 MiB / RTT)
Advertised window source	hardcoded 65535	`getsockopt(TCP_INFO).tcpi_rcv_space`

Architecture

New TcpNatEntry::guest_window: u32 and guest_window_scale: u8 (#[serde(default)] for snapshot back-compat with pre-6.3).
SYN handler parses tcp.window_scale() option and stashes it; every incoming frame refreshes entry.guest_window = u32::from(tcp.window_len()) << guest_window_scale.
relay_tcp_nat_data adds a window_remaining = guest_window - bytes_in_flight gate; when zero, breaks out (waits for guest ACK).
build_tcp_packet_static signature now takes (window_len, window_scale). SYN-ACK passes (65535, Some(7)); data/ACK frames pass (host_recv_window(fd), None).
New host_recv_window(fd) -> u16 helper: one getsockopt(IPPROTO_TCP, TCP_INFO, ...) call, returns tcpi_rcv_space >> 7 clamped to u16::MAX. Falls back to 32768 on syscall error.

Bench evidence — divan microbenches (vs current `main`)

scripts/bench-compare.sh --baseline origin/main --skip-vm:

Bench	Baseline	HEAD	Δ%
Wins
`tcp_inbound_syn_ack_transition`	63.1 µs	50.9 µs	-19.4%
`process_udp_frame`	31.9 µs	27.2 µs	-14.7%
`port_forward_accept_latency`	193 µs	182 µs	-5.9%
Parity / noise
`dns_cache_hit`	942 ns	940 ns	-0.1%
`nat_translate_outbound_hot_path`	2.54 ns	2.57 ns	+1.3%
`flow_table_insert_remove/100`	4.49 µs	4.45 µs	-0.9%
`process_syn`	28.5 µs	28.7 µs	+0.8%
`tcp_inbound_syn_ack_transition`	63.1 µs	50.9 µs	-19.4%
Small regressions (per-frame `getsockopt(TCP_INFO)` cost)
`tcp_bulk_throughput_1mb`	61.3 ms	62.3 ms	+1.6%
`tcp_rx_latency_one_packet`	9.55 µs	10.2 µs	+6.7%
`process_icmp_echo_request`	21.9 µs	23.3 µs	+6.1%
`flow_table_insert_remove/1000`	29.6 µs	31.4 µs	+5.9%
`poll_with_n_mixed_flows/999`	9.76 µs	10.1 µs	+3.5%
New benches (HEAD only — Phase 6.3 introduces them)
`tcp_bulk_throughput_constrained_window/4096`	—	2.63 ms	new
`tcp_bulk_throughput_constrained_window/16384`	—	1.81 ms	new
`tcp_bulk_throughput_constrained_window/65536`	—	1.94 ms	new

Wall-clock VM harness (`voidbox-network-bench`)

Metric	Baseline (post-#78 main)	HEAD	Δ%
`tcp_rr_latency_us_p50`	4 µs	2 µs	-50.0%
`tcp_rr_latency_us_p99`	32 µs	26 µs	-18.8%
`tcp_crr_latency_us_p50`	10129 µs	10136 µs	+0.1% (parity)
`tcp_throughput_g2h_mbps`	5943	5739	-3.4%

The 3.4% g2h throughput regression appears to be the per-outgoing-frame getsockopt(TCP_INFO) syscall cost in host_recv_window. Profiling planned as a follow-up — candidate fixes: cache the value with a 1 ms TTL, or move the syscall onto the net-poll thread's housekeeping cadence so the data path uses a stale-but-recent value. The trade is worth it for now: correct backpressure is a correctness fix, not a perf trick. Phase 6.4 epoll dispatch absorbs the latency improvements (RR p50 -50%) so the net change vs pre-Phase-6.x main is heavily positive.

Snapshot interaction

Pre-6.3 snapshots restore cleanly: both new fields have #[serde(default)] and default to (65535, 0) which is the pre-6.3 behavior (no scale, ignore guest window — same as if the entry was a Phase 6.0 entry). Verified via existing snapshot_integration suite.

passt-comparison status

Documented as a deferred task in docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md ("passt head-to-head methodology"). Methodology agreed: same hardware, two-column report, focus on CRR latency (apples-to-apples since CRR is dominated by NAT-table ops, not MMIO exit overhead). Building the passt+qemu reference harness is a separate follow-up PR.

Commits (10)

Cherry-picked clean from smoltcp-passt-port-phase6.3-window-mgmt onto current main (post-#78):

docs: Phase 6.3 detailed TDD plan — TCP window management
feat(slirp): TcpNatEntry tracks guest_window + guest_window_scale
feat(slirp): parse guest's window_scale on SYN, store on flow
feat(slirp): track guest's advertised window on every incoming frame
refactor(slirp): build_tcp_packet_static takes (window_len, window_scale)
feat(slirp): advertise host-kernel-derived window on outgoing frames
test(network): pin tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE)
feat(slirp): gate host→guest send on guest's advertised window — flips the BROKEN_ON_PURPOSE pin
test(network): pin tcp_window_scale_negotiated_in_synack
bench(network): tcp_bulk_throughput_constrained_window parametric

Test plan

cargo fmt --all -- --check — clean
cargo clippy --workspace --all-targets --all-features -- -D warnings — clean
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features — clean
cargo test --test network_baseline -- --test-threads=1 — 24/24 (was 22; +2 window pins)
cargo test --test network_baseline --features bench-helpers -- --test-threads=1 — 26/26
scripts/bench-compare.sh --baseline origin/main --skip-vm — see table above
scripts/bench-compare.sh --baseline origin/main --skip-divan (VM wall-clock) — see table above
CI

Replaces draft #75

Same window-management content via the now-superseded #74 chain. Close #75 once this lands.

Follow-ups (not blocking this PR)

host_recv_window perf: profile the +1.6% bulk regression; cache TCP_INFO with short TTL or move into housekeeping cadence.
passt head-to-head harness (separate PR per the deferred plan).
HashMap cache-miss audit of the flow table (HashMap<FlowKey, FlowEntry>) — separately tracked: data-path pollers do linear scans by FlowKey variant, which on a 1000-flow table at small entries is cache-unfriendly. Candidate: split into per-protocol maps or move to small-vector for low-flow-count sandboxes.

10 bite-sized tasks covering proper TCP windowing: - TcpNatEntry tracks guest_window (u32) + guest_window_scale (u8) - handle_tcp_frame parses tcp.window_scale() on guest SYN, stores per-flow; updates guest_window on every incoming frame - build_tcp_packet_static signature changes to take (window_len, window_scale) — caller decides - SYN-ACK negotiates OUR_WINDOW_SCALE = 7 (passt's default; 128x) - New host_recv_window helper queries TCP_INFO.tcpi_rcv_space and scales it for the advertised window on outgoing frames - relay_tcp_nat_data gates host→guest sends on entry.guest_window to honor real backpressure - Three new pins: tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE → flips at Task 7), tcp_window_scale_negotiated_in_synack, plus tcp_bulk_throughput_constrained_window parametric bench Severity: MEDIUM — perf gap. Hardcoded window_len: 65535 caps throughput at 64 KB / RTT regardless of bandwidth, and inject_to_guest can grow unbounded if the guest is slow.

…ale)

…ON_PURPOSE)

Adds tcp_bulk_throughput_constrained_window bench that exercises the Task 7 window-gating path under three guest-window sizes (4096, 16384, 65536 bytes). Mirrors tcp_bulk_throughput_1mb with a parametric window so regressions in window-constrained relay show up numerically.

dpsoft · 2026-05-06T12:50:32Z

Profiling note: tcp_bulk_throughput_1mb regression root-caused

Followed up on the divan +1.6% / VM wall-clock -3.4% throughput regression with perf-agent (eBPF, PMU + on-CPU + off-CPU). 30 s capture, single bench process, properly symbolized.

PMU summary

Metric	Value	Reading
IPC	0.777	moderately memory-bound (threshold: <0.8)
Cache misses / 1K instr	3.666	below the 10/1K "investigate" threshold — HashMap cache-miss hypothesis NOT confirmed for this workload
P99.9 on-CPU	10.17 ms	healthy (<50 ms)
Preempted	46.5%	expected for a CPU-bound bench

On-CPU flat hotspots

Function	Flat %	Note
`handle_tcp_frame`	26.70%	per-incoming-frame parsing (smoltcp wire + dispatch)
`__libc_recv`	29.90% cum	host kernel TCP recv
`__libc_send`	25.03% cum	host kernel TCP send
`EpollDispatch::wait_with_timeout`	13.63%	epoll_wait + drain
`__getsockopt`	5.70%	`host_recv_window`'s per-outgoing-frame `getsockopt(TCP_INFO)` — the regression

Conclusion

The throughput regression traces to one syscall, not the data-structure layout. host_recv_window calls getsockopt(IPPROTO_TCP, TCP_INFO, ...) on every outgoing frame; at 5 Gbps that's ~10k getsockopt/s on the data path.

Proposed follow-up (separate small PR, not blocking this one)

Cache host_recv_window per-flow with a short TTL — say 5 ms, well below RTT. At 10k frames/s that drops to ~200 getsockopt/s, ~50× reduction, while the advertised window still tracks within 5 ms of reality.

// On TcpNatEntry:
cached_recv_window: u16,
cached_recv_window_at: Instant,

// In the build_tcp_packet_static call sites for data/ACK frames:
const RECV_WINDOW_TTL: Duration = Duration::from_millis(5);
if entry.cached_recv_window_at.elapsed() > RECV_WINDOW_TTL {
    entry.cached_recv_window = host_recv_window(entry.host_stream.as_raw_fd());
    entry.cached_recv_window_at = Instant::now();
}

The HashMap-flow-table cache-miss audit is still a worthwhile separate exercise, but the divan/wall-clock regression seen on this PR isn't traceable to it. IPC of 0.78 suggests we're modestly memory-bound elsewhere (likely the smoltcp wire-decode hot path), but the cache-miss rate doesn't indicate pathological structures.

Profiles archived locally:

/tmp/p63-bench-cpu.pb.gz (CPU stacks)
/tmp/p63-bench-offcpu.pb.gz (off-CPU)
/tmp/p63-bench-pmu.txt (PMU)

Profiling tcp_bulk_throughput_1mb showed __getsockopt at 5.7% flat CPU — Phase 6.3's host_recv_window was issuing one getsockopt(TCP_INFO) per outgoing TCP frame, costing ~10k syscalls/s at line rate. Cache the result on TcpNatEntry and refresh only every RECV_WINDOW_TTL (5 ms). At line rate this collapses to ~200 syscalls/s — a ~50x reduction — while the advertised window stays within 5 ms of reality, which is well below any realistic RTT. cached_recv_window is initialized at flow construction with one host_recv_window call so the first emitted frame doesn't pay the syscall cost on the data path either.

dpsoft · 2026-05-06T13:44:44Z

Cache fix landed and re-profiled — regression eliminated

Commit 1b9ba72 adds per-flow cached_recv_window with a 5 ms TTL. The result confirms the profiling diagnosis was correct: __getsockopt is no longer in the top hotspots and the Phase 6.3 throughput regression collapses to noise.

Divan microbenches — before/after the cache fix (vs current `main`)

Bench	Pre-fix Δ%	Post-fix Δ%	Recovery
`tcp_bulk_throughput_1mb`	+1.6%	+0.1%	regression eliminated
`tcp_rx_latency_one_packet`	+6.7%	+2.3%	recovered 4.4 pp
`tcp_inbound_syn_ack_transition`	-19.4%	-30.5%	even faster post-fix
`process_icmp_echo_request`	+6.1%	+1.9%	recovered 4.2 pp
`flow_table_insert_remove/1000`	+5.9%	-2.0%	now better than baseline

Some flow-construction benches show small regressions (process_syn +4.6%, port_forward_accept_latency +6.1%, process_syn_during_pending_connects/0 +7.2%) — that's the one-time host_recv_window syscall now at flow-creation rather than per-frame. Pay-once-per-flow vs pay-per-packet is the right trade. At line rate (~10k packets/s, ~50 connects/s) this is a >100× syscall reduction.

VM wall-clock — before/after vs current `main`

Metric	Pre-fix Δ%	Post-fix Δ%
`tcp_throughput_g2h_mbps`	-3.4% (5942 → 5739)	-0.2% (5776 → 5765)
`tcp_rr_latency_us_p50`	-50%	parity (both at 2 µs)
`tcp_crr_latency_us_p50`	parity	parity

PMU — before/after (same 30s capture per side, single bench process)

Metric	Pre-fix	Post-fix	Δ
IPC	0.777	0.786	+1.2%
Cache Misses / 1K instr	3.666	3.924	+7.0% (denominator effect)
Total Cache Misses (abs)	86.83 M	84.36 M	-2.85%
Total Instructions	23.68 B	21.50 B	-9.2%
Total Cycles	30.47 B	27.35 B	-10.3%
P99.9 on-CPU	10.17 ms	9.41 ms	-7.5%

Total work dropped ~10% (less syscall traffic), IPC improved, and absolute cache misses fell 2.85%. The per-1K-instr rate ticked up because we removed a lot of cache-friendly syscall instructions from the denominator — the remaining mix is slightly more miss-dense but __getsockopt no longer dominates the on-CPU profile.

On-CPU top-7 — before/after

Function	Pre-fix flat %	Post-fix flat %
`handle_tcp_frame`	26.70%	25.00%
`__libc_recv` (cum)	29.90%	35.71%
`__libc_send` (cum)	25.03%	24.40%
`EpollDispatch::wait_with_timeout`	13.63%	16.07%
`__getsockopt`	5.70%	— (gone from top-25)
`process_guest_frame`	5.84%	4.46%
`drain_to_guest`	6.54%	6.25%

HashMap cache-miss hypothesis — verdict

At 3.92 cache-misses / 1K instructions (post-fix, well below the 10/1K threshold), the flow-table HashMap does not appear to be a dominant cache-pressure source for tcp_bulk_throughput_1mb. IPC of 0.786 says we're still mildly memory-bound, but it's not localised to the data structure. Hypothesis not confirmed by data on this workload. Worth re-investigating under different workloads (many concurrent flows, different per-entry sizes) but not blocking this PR.

Profiles archived locally:

pre-fix: /tmp/p63-bench-{cpu,offcpu,pmu}.{pb.gz,txt}
post-fix: /tmp/p63-fixed-{cpu,offcpu,pmu}.{pb.gz,txt}

dpsoft added 10 commits May 6, 2026 08:00

feat(slirp): TcpNatEntry tracks guest_window + guest_window_scale

a6992c8

feat(slirp): parse guest's window_scale on SYN, store on flow

9745824

feat(slirp): track guest's advertised window on every incoming frame

2789673

refactor(slirp): build_tcp_packet_static takes (window_len, window_sc…

78d1554

…ale)

feat(slirp): advertise host-kernel-derived window on outgoing frames

4e6eb87

test(network): pin tcp_advertised_window_tracks_guest_buffer (BROKEN_…

540c96b

…ON_PURPOSE)

feat(slirp): gate host→guest send on guest's advertised window

4405569

test(network): pin tcp_window_scale_negotiated_in_synack

3da83d0

dpsoft mentioned this pull request May 6, 2026

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space #75

Closed

dpsoft mentioned this pull request May 6, 2026

perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths #80

Merged

7 tasks

dpsoft merged commit e625eff into main May 6, 2026
22 checks passed

dpsoft deleted the phase6.3-window-mgmt-rebased branch May 6, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79
dpsoft merged 11 commits intomainfrom
phase6.3-window-mgmt-rebased

dpsoft commented May 6, 2026

Uh oh!

dpsoft commented May 6, 2026

Uh oh!

dpsoft commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dpsoft commented May 6, 2026

What this branch does

Headline win

Architecture

Bench evidence — divan microbenches (vs current main)

Wall-clock VM harness (voidbox-network-bench)

Snapshot interaction

passt-comparison status

Commits (10)

Test plan

Replaces draft #75

Follow-ups (not blocking this PR)

Uh oh!

dpsoft commented May 6, 2026

Profiling note: tcp_bulk_throughput_1mb regression root-caused

PMU summary

On-CPU flat hotspots

Conclusion

Proposed follow-up (separate small PR, not blocking this one)

Uh oh!

dpsoft commented May 6, 2026

Cache fix landed and re-profiled — regression eliminated

Divan microbenches — before/after the cache fix (vs current main)

VM wall-clock — before/after vs current main

PMU — before/after (same 30s capture per side, single bench process)

On-CPU top-7 — before/after

HashMap cache-miss hypothesis — verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bench evidence — divan microbenches (vs current `main`)

Wall-clock VM harness (`voidbox-network-bench`)

Divan microbenches — before/after the cache fix (vs current `main`)

VM wall-clock — before/after vs current `main`